Skip to content

fix bug for float duration src_audio#1421

Merged
Artiprocher merged 1 commit into
modelscope:mainfrom
mi804:ace-step1
Apr 29, 2026
Merged

fix bug for float duration src_audio#1421
Artiprocher merged 1 commit into
modelscope:mainfrom
mi804:ace-step1

Conversation

@mi804
Copy link
Copy Markdown
Collaborator

@mi804 mi804 commented Apr 29, 2026

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the AceStepPipeline to improve the handling of float durations by transitioning from floor division to rounding for frame calculations. It also introduces a logic change in the cover task to concatenate source latents when the source audio is shorter than the generated latents. Additionally, the default shift parameter in the documentation has been updated from 1.0 to 3.0. Feedback was provided to ensure consistency by using round() instead of int() across all frame conversions to prevent off-by-one errors, and to add a null check for source_latents to avoid a potential AttributeError.

Comment on lines +508 to +509
if src_latents.shape[1] > source_latents.shape[1]:
source_latents = torch.cat([source_latents, src_latents[:, source_latents.shape[1]:]], dim=1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If the cover task is initiated using audio_code_string instead of src_audio, source_latents will be None. Accessing .shape[1] on a None object will raise an AttributeError. A check should be added to ensure source_latents is not None before performing the shape comparison and concatenation.

Suggested change
if src_latents.shape[1] > source_latents.shape[1]:
source_latents = torch.cat([source_latents, src_latents[:, source_latents.shape[1]:]], dim=1)
if source_latents is not None and src_latents.shape[1] > source_latents.shape[1]:
source_latents = torch.cat([source_latents, src_latents[:, source_latents.shape[1]:]], dim=1)

Comment on lines +452 to +453
padding_frames_left = int(pad_left * pipe.vae.sampling_rate)
padding_frames_right = max(int(max_right * pipe.vae.sampling_rate) - src_audio.shape[-1], 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using int() for converting seconds to audio frames can lead to off-by-one errors due to floating-point precision (e.g., 0.1 * 48000 might result in 4799.999999999999 which int() floors to 4799). Since this PR aims to fix float duration issues, using round() is more appropriate and consistent with the changes made elsewhere in this file.

Suggested change
padding_frames_left = int(pad_left * pipe.vae.sampling_rate)
padding_frames_right = max(int(max_right * pipe.vae.sampling_rate) - src_audio.shape[-1], 0)
padding_frames_left = round(pad_left * pipe.vae.sampling_rate)
padding_frames_right = max(round(max_right * pipe.vae.sampling_rate) - src_audio.shape[-1], 0)

Comment on lines +466 to +467
start_frame = int(start * pipe.vae.sampling_rate / 1920)
end_frame = int(end * pipe.vae.sampling_rate / 1920)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the calculation of max_latent_length (line 513), these frame indices should be calculated using round() instead of int(). This ensures that float durations are mapped to the same latent frames across different parts of the pipeline.

Suggested change
start_frame = int(start * pipe.vae.sampling_rate / 1920)
end_frame = int(end * pipe.vae.sampling_rate / 1920)
start_frame = round(start * pipe.vae.sampling_rate / 1920)
end_frame = round(end * pipe.vae.sampling_rate / 1920)

Comment on lines +470 to +471
pad_left_frames = int(padding_frames_left / 1920)
pad_right_frames = int(padding_frames_right / 1920)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using round() here is more consistent with the mapping of audio duration to latent frames used in the rest of the pipeline. This helps avoid potential artifacts at the boundaries when dealing with float durations.

Suggested change
pad_left_frames = int(padding_frames_left / 1920)
pad_right_frames = int(padding_frames_right / 1920)
pad_left_frames = round(padding_frames_left / 1920)
pad_right_frames = round(padding_frames_right / 1920)

@Artiprocher Artiprocher merged commit 534e2e4 into modelscope:main Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants